In this session we will use example datasets from single-cell RNA sequencing and spatial transcriptomics to demonstrate additional types of visualizations, including reduced dimension plots, spatial plots, and heatmaps.
We will use R packages from Bioconductor for the data processing and some of the visualizations.
To install packages from Bioconductor, first install the
BiocManager package from CRAN with
install.packages("BiocManager"), and then use the command
BiocManager::install().
library(tidyverse)
library(here)
library(ggplot2)
library(SingleCellExperiment)
library(SpatialExperiment)
library(STexampleData)
library(TENxPBMCData)
library(scater)
library(scran)
library(pheatmap)
library(ComplexHeatmap)
We load an example single-cell dataset from the
TENxPBMCData package. This package contains single-cell RNA
sequencing data from peripheral blood mononuclear cells (PBMCs) measured
with 10x Genomics platforms, and saved in
SingleCellExperiment format.
# load dataset
sce <- TENxPBMCData(dataset = "pbmc4k")
## see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
## loading from cache
sce
## class: SingleCellExperiment
## dim: 33694 4340
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames: NULL
## colData names(11): Sample Barcode ... Individual Date_published
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
dim(sce)
## [1] 33694 4340
We run some steps from a short analysis workflow on this dataset. We skip several analysis steps here (in particular quality control), since we are mainly interested in using this dataset for demonstration purposes to generate several new types of visualizations.
For more details on the analysis workflow, see the Orchestrating Single-Cell Analysis with Bioconductor online book.
# preprocessing
rownames(sce) <- rowData(sce)$Symbol_TENx
# normalization
set.seed(123)
quickclus <- quickCluster(sce)
table(quickclus)
## quickclus
## 1 2 3 4 5 6 7 8 9 10 11
## 588 361 264 161 866 365 531 536 360 207 101
sce <- computeSumFactors(sce, cluster = quickclus)
sce <- logNormCounts(sce)
# variance modeling
set.seed(123)
dec <- modelGeneVarByPoisson(sce)
hvgs <- getTopHVGs(dec, prop = 0.1)
# dimensionality reduction
set.seed(123)
sce <- denoisePCA(sce, subset.row = hvgs, technical = dec)
set.seed(123)
sce <- runTSNE(sce, dimred = "PCA")
set.seed(123)
sce <- runUMAP(sce, dimred = "PCA")
# clustering
g <- buildSNNGraph(sce, k = 10, use.dimred = "PCA")
clus <- igraph::cluster_walktrap(g)$membership
table(clus)
## clus
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 663 610 249 353 166 322 993 391 165 47 156 95 78 23 29
colLabels(sce) <- factor(clus)
# differential expression
markers <- findMarkers(sce, pval.type = "some", direction = "up")
Here, we demonstrate several types of reduced dimension plots. These provide alternative ways to visualize structure in high-dimensional data.
To do this, we need to first create a data frame containing all the
variables to plot. These are stored in colData(sce) and
reducedDim(sce) in the SingleCellExperiment
object.
# create data frame
df <- cbind.data.frame(
colData(sce),
reducedDim(sce, "PCA"),
reducedDim(sce, "TSNE"),
reducedDim(sce, "UMAP")
)
head(df, 3)
## Sample Barcode Sequence Library Cell_ranger_version
## 1 pbmc4k AAACCTGAGAAGGCCT-1 AAACCTGAGAAGGCCT 1 v2.1
## 2 pbmc4k AAACCTGAGACAGACC-1 AAACCTGAGACAGACC 1 v2.1
## 3 pbmc4k AAACCTGAGATAGTCA-1 AAACCTGAGATAGTCA 1 v2.1
## Tissue_status Barcode_type Chemistry Sequence_platform Individual
## 1 <NA> Chromium Chromium_v2 Hiseq4000 HealthyDonor1
## 2 <NA> Chromium Chromium_v2 Hiseq4000 HealthyDonor1
## 3 <NA> Chromium Chromium_v2 Hiseq4000 HealthyDonor1
## Date_published sizeFactor label PC1 PC2 PC3 PC4
## 1 2017-11-08 0.4032459 6 17.04326 -0.2718908 -1.202488 -3.967078
## 2 2017-11-08 0.7397772 1 16.31126 -2.1582928 -5.494709 -1.364433
## 3 2017-11-08 0.4294928 6 20.69168 4.9046899 1.752722 -5.893432
## PC5 PC6 PC7 TSNE1 TSNE2 UMAP1 UMAP2
## 1 1.7250706 -0.6468792 0.6157505 -34.45390 4.498456 -13.82578 0.4271781
## 2 -0.9776328 -2.5657073 0.2732182 -23.05328 6.489842 -12.44451 -1.0040864
## 3 -1.5285852 3.5752710 3.4892835 -40.73745 4.826225 -14.42250 1.3044480
First, we generate a principal component analysis (PCA) plot showing
the first two principal components (PCs) using ggplot. We
plot clusters by color and add several additional formatting
options.
# PCA plot
ggplot(df, aes(x = PC1, y = PC2, color = label)) +
geom_point(alpha = 0.5, size = 0.5) +
ggtitle("PCA") +
theme_bw() +
theme(panel.grid = element_blank())
Alternatively, we can generate t-SNE or UMAP plots. The t-SNE and UMAP algorithms are commonly used in the single-cell field to visualize clustering results in high-dimensional data. These are relatively complicated non-linear algorithms, which tend to provide a better separation of clusters in a two-dimensional visualization. However, caution is also needed when interpreting the plots, since they can be misleading by introducing cluster separation where none exists, and by generating an arbitrary relative spatial arrangement of the clusters.
Observe that the clusters are more clearly separated in the UMAP plot than in the t-SNE plot, and that the relative spatial arrangement of the clusters differs between the two plots (and also differs between random seeds when running the algorithms).
# t-SNE plot
ggplot(df, aes(x = TSNE1, y = TSNE2, color = label)) +
geom_point(alpha = 0.5, size = 0.5) +
ggtitle("t-SNE") +
theme_bw() +
theme(panel.grid = element_blank())
# UMAP plot
ggplot(df, aes(x = UMAP1, y = UMAP2, color = label)) +
geom_point(alpha = 0.5, size = 0.5) +
ggtitle("UMAP") +
theme_bw() +
theme(panel.grid = element_blank())
Here, we demonstrate heatmaps as a way to visualize the top expressed marker genes per cluster.
First, we obtain the top marker genes for cluster 5 using some additional code continuing from the workflow above.
# top marker genes for cluster 5
top_markers <- markers[["5"]]
best <- top_markers[1:20, ]
lfcs <- getMarkerEffects(best)
Now, create a heatmap using the pheatmap package. This
shows marker genes in rows, clusters in columns, and log-fold-change
expression as color gradient. Rows and columns are automatically grouped
using additional clustering (which may either help the reader or
possibly mislead or overcomplicate things, depending on the
dataset).
# heatmap using pheatmap
pheatmap(lfcs, breaks = seq(-5, 5, length.out = 101))
An alternative package to create heatmaps is ComplexHeatmap, which provides extensive and powerful options for customized formatting, and includes detailed documentation and tutorials.
Here, we re-create the above heatmap using
ComplexHeatmap, along with additional formatting
options.
# heatmap using ComplexHeatmap
Heatmap(lfcs)
# ComplexHeatmap with additional formatting
Heatmap(
lfcs,
name = "log2-fold change",
column_title = "cluster",
column_title_side = "bottom",
row_title = "gene",
row_names_gp = gpar(fontface = "italic"),
rect_gp = gpar(col = "white", lwd = 1.5)
)
Alternatively, we can also use violin plots to visualize the top
marker genes per cluster. Below, we show an example using a plotting
function from the scater package.
This shows the same information as the heatmap above, represented using a different type of visualization. (Which is more effective for this dataset?)
# violin plots
plotExpression(sce, features = rownames(best), x = "label", color_by = "label")
In this example, we use a spatial transcriptomics dataset from the
STexampleData package. This is a spatial transcriptomics
dataset from the dorsolateral prefrontal cortex (DLPFC) region in a
postmortem human brain sample, measured with the 10x Genomics Visium
platform, and saved in SpatialExperiment format.
# load dataset
spe <- Visium_humanDLPFC()
## see ?STexampleData and browseVignettes('STexampleData') for documentation
## loading from cache
spe
## class: SpatialExperiment
## dim: 33538 4992
## metadata(0):
## assays(1): counts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(3): gene_id gene_name feature_type
## colnames(4992): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
## TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
## colData names(7): barcode_id sample_id ... ground_truth cell_count
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
## imgData names(4): sample_id image_id data scaleFactor
dim(spe)
## [1] 33538 4992
We run some steps from a short analysis workflow to prepare the dataset. As above, we skip several analysis steps here, since we are using the dataset for demonstration purposes to generate visualizations only.
For more details on the analysis workflow, see the Best Practices for Spatial Transcriptomics Analysis with Bioconductor online book.
The ggspavis
package also provides plotting functions for spatial transcriptomics
data. Here, we use ggplot code instead, to show how the
plots are built up.
# preprocessing
spe <- spe[, colData(spe)$in_tissue == 1]
colnames(spatialCoords(spe)) <- c("x", "y")
colData(spe)$label <- colData(spe)$ground_truth
# normalization
spe <- logNormCounts(spe)
# feature selection
is_mito <- grepl("(^MT-)|(^mt-)", rowData(spe)$gene_name)
table(is_mito)
## is_mito
## FALSE TRUE
## 33525 13
spe <- spe[!is_mito, ]
dec <- modelGeneVar(spe)
top_hvgs <- getTopHVGs(dec, prop = 0.1)
# dimensionality reduction
set.seed(123)
spe <- runPCA(spe, subset_row = top_hvgs)
set.seed(123)
spe <- runUMAP(spe, dimred = "PCA")
colnames(reducedDim(spe, "UMAP")) <- paste0("UMAP", 1:2)
For spatial data, we can generate a spatial plot showing annotation labels or expression intensity in x-y coordinates.
First, we create a data frame containing the data for each spot
(spatial measurement location), stored in colData(spe),
spatialCoords(spe), and reducedDim(spe) in the
SpatialExperiment object.
# set up data frame
df <- cbind.data.frame(
colData(spe),
spatialCoords(spe),
reducedDim(spe, "PCA"),
reducedDim(spe, "UMAP")
)
head(df, 2)
## barcode_id sample_id in_tissue array_row
## AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673 1 50
## AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673 1 3
## array_col ground_truth cell_count label sizeFactor x
## AAACAAGTATCTCCCA-1 102 Layer3 6 Layer3 1.8428941 9791
## AAACAATCTACTAGCA-1 43 Layer1 16 Layer1 0.3632188 5769
## y PC1 PC2 PC3 PC4 PC5
## AAACAAGTATCTCCCA-1 8468 4.854207 1.065363 -0.4216272 1.7875615 -0.2024489
## AAACAATCTACTAGCA-1 2807 1.295053 -4.155685 1.3642721 -0.1832911 -1.4907549
## PC6 PC7 PC8 PC9 PC10
## AAACAAGTATCTCCCA-1 -0.7078520 -1.019253 -0.4320465 -1.2705471 1.0381851
## AAACAATCTACTAGCA-1 -0.4808741 -2.390182 -2.6052694 0.7815192 -0.8981974
## PC11 PC12 PC13 PC14 PC15
## AAACAAGTATCTCCCA-1 0.3112108 -1.010108 -0.8010832 0.4172931 -0.5170250
## AAACAATCTACTAGCA-1 2.6791549 -1.956611 0.2053435 1.6410917 0.6911729
## PC16 PC17 PC18 PC19 PC20
## AAACAAGTATCTCCCA-1 -0.2637965 -0.724249 0.2981517 -0.6003563 0.0831525
## AAACAATCTACTAGCA-1 -0.9726718 5.115035 -1.3021309 1.4019829 -0.3443739
## PC21 PC22 PC23 PC24 PC25
## AAACAAGTATCTCCCA-1 -0.632314 -0.006210057 0.3180925 0.6784704 -0.31130902
## AAACAATCTACTAGCA-1 1.192591 -1.471208681 0.1300177 -2.1604481 -0.09396313
## PC26 PC27 PC28 PC29 PC30
## AAACAAGTATCTCCCA-1 -0.04425171 -0.3914127 1.0541277 -1.210250 0.1229272
## AAACAATCTACTAGCA-1 0.14986324 0.3577853 -0.1819574 -1.830997 -0.1061296
## PC31 PC32 PC33 PC34 PC35
## AAACAAGTATCTCCCA-1 0.6136811 0.06691078 -0.1427879 0.2381235 -0.1877092
## AAACAATCTACTAGCA-1 0.2324755 1.14920270 -1.1348378 -0.5429243 0.7826222
## PC36 PC37 PC38 PC39 PC40
## AAACAAGTATCTCCCA-1 0.2447371 -1.145837 -0.51591009 -0.4117514 -1.0601292
## AAACAATCTACTAGCA-1 -2.3139240 2.190438 -0.04699769 0.9157029 -0.7106704
## PC41 PC42 PC43 PC44 PC45
## AAACAAGTATCTCCCA-1 -0.7755413 0.51368093 0.2211477 0.2450753 0.06177133
## AAACAATCTACTAGCA-1 0.9319064 -0.04174465 -0.5759469 -3.6146601 0.42623415
## PC46 PC47 PC48 PC49 PC50 UMAP1
## AAACAAGTATCTCCCA-1 0.5209681 0.7982671 -0.0383644 -0.0163750 0.466978 1.622433
## AAACAATCTACTAGCA-1 3.2290276 3.9403996 -0.5511797 -0.8657158 2.520766 1.930111
## UMAP2
## AAACAAGTATCTCCCA-1 -1.7603560
## AAACAATCTACTAGCA-1 -0.8818074
Now, we can create a spatial plot showing manually annotated reference labels in x-y coordinates. We also include several plot formatting options.
# color palette
pal <- c("#F0027F", "#377EB8", "#4DAF4A", "#984EA3",
"#FFD700", "#FF7F00", "#1A1A1A", "#666666")
# spatial plot showing annotation labels
ggplot(df, aes(x = x, y = y, color = label)) +
geom_point(size = 0.6) +
scale_color_manual(values = pal) +
coord_fixed() +
scale_y_reverse() +
ggtitle("Manually annotated") +
theme_bw() +
theme(panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
guides(color = guide_legend(override.aes = list(size = 2)))
Alternatively, create a plot showing expression intensity of a specific gene of interest in x-y coordinates.
# set up data frame
my_gene <- "MOBP"
ix <- which(rowData(spe)$gene_name == my_gene)
df <- cbind(df, gene = counts(spe)[ix, ])
# spatial plot showing gene expression
ggplot(df, aes(x = x, y = y, color = gene)) +
geom_point(size = 0.6) +
scale_color_gradient(low = "gray85", high = "blue") +
coord_fixed() +
scale_y_reverse() +
ggtitle(my_gene) +
theme_bw() +
theme(panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
We can also generate reduced dimension plots for this dataset to further investigate the annotation labels or a gene of interest.
# PCA plot showing annotation labels
ggplot(df, aes(x = PC1, y = PC2, color = label)) +
geom_point(alpha = 0.5, size = 0.5) +
ggtitle("PCA") +
theme_bw() +
theme(panel.grid = element_blank()) +
guides(color = guide_legend(override.aes = list(size = 2)))
# UMAP plot showing annotation labels
ggplot(df, aes(x = UMAP1, y = UMAP2, color = label)) +
geom_point(alpha = 0.5, size = 0.5) +
ggtitle("UMAP") +
theme_bw() +
theme(panel.grid = element_blank()) +
guides(color = guide_legend(override.aes = list(size = 2)))
# UMAP plot showing gene expression
ggplot(df, aes(x = UMAP1, y = UMAP2, color = gene)) +
geom_point(alpha = 0.5, size = 0.5) +
scale_color_gradient(low = "gray85", high = "blue") +
ggtitle(my_gene) +
theme_bw() +
theme(panel.grid = element_blank())